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ABSTRACT 

The relative ease of collaborative data science and analysis has led 
to a proliferation of many thousands or millions of versions of the 
same datasets in many scientific and commercial domains, acquired 
or constructed at various stages of data analysis across many users, 
and often over long periods of time. Managing, storing, and recre¬ 
ating these dataset versions is a non-trivial task. The fundamental 
challenge here is the storage-recreation trade-off: the more storage 
we use, the faster it is to recreate or retrieve versions, while the 
less storage we use, the slower it is to recreate or retrieve versions. 
Despite the fundamental nature of this problem, there has been a 
surprisingly little amount of work on it. In this paper, we study 
this trade-off in a principled manner: we formulate six problems 
under various settings, trading off these quantities in various ways, 
demonstrate that most of the problems are intractable, and propose 
a suite of inexpensive heuristics drawing from techniques in delay- 
constrained scheduling, and spanning tree literature, to solve these 
problems. We have built a prototype version management system, 
that aims to serve as a foundation to our DataHub system for fa¬ 
cilitating collaborative data science GD We demonstrate, via ex¬ 
tensive experiments, that our proposed heuristics provide efficient 
solutions in practical dataset versioning scenarios. 

1. INTRODUCTION 

The massive quantities of data being generated every day, and 
the ease of collaborative data analysis and data science have led to 
severe issues in management and retrieval of datasets. We motivate 
our work with two concrete example scenarios. 

• [Intermediate Result Datasets] For most organizations deal¬ 
ing with large volumes of diverse datasets, a common sce¬ 
nario is that many datasets are repeatedly analyzed in slightly 
different ways, with the intermediate results stored for fu¬ 
ture use. Often, we find that the intermediate results are the 
same across many pipelines (e.g., a PageRank computation 
on the Web graph is often part of a multi-step workfiow). 
Often times, the datasets being analyzed might be slightly 
different (e.g., results of simple transformations or cleaning 
operations, or small updates), but are still stored in their en¬ 
tirety. There is currently no way of reducing the amount of 
stored data in such a scenario: there is massive redundancy 
and duplication (this was corroborated by our discussions 
with a large software company), and often the computation 
required to recompute a given version from another one is 
small enough to not merit storing a new version. 
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• [Data Science Dataset Versions] In our conversations with 
a computational biology group, we found that every time a 
data scientist wishes to work on a dataset, they make a pri¬ 
vate copy, perform modifications via cleansing, normaliza¬ 
tion, adding new fields or rows, and then store these modified 
versions back to a folder shared across the entire group. Once 
again there is massive redundancy and duplication across 
these copies, and there is a need to minimize these storage 
costs while keeping these versions easily retrievable. 

In such scenarios and many others, it is essential to keep track of 
versions of datasets and be able to recreate them on demand; and 
at the same time, it is essential to minimize the storage costs by re¬ 
ducing redundancy and duplication. The ability to manage a large 
number of datasets, their versions, and derived datasets, is a key 
foundational piece of a system we are building for facilitating col¬ 
laborative data science, called DataHub GD DataHub enables 
users to keep track of datasets and their versions, represented in the 
form of a directed version graph that encodes derivation relation¬ 
ships, and to retrieve one or more of the versions for analysis. 

In this paper, we focus on the problem of trading off storage costs 
and recreation costs in a principled fashion. Specifically, the prob¬ 
lem we address in this paper is: given a collection of datasets as 
well as (possibly) a directed version graph connecting them, mini¬ 
mize the overall storage for storing the datasets and the recreation 
costs for retrieving them. The two goals confiict with each other 
— minimizing storage cost typically leads to increased recreation 
costs and vice versa. We illustrate this trade-off via an example. 

Example 1. Figure^i) displays a version graph, indicating 
the derivation relationships among 5 versions. Let Vi be the origi¬ 
nal dataset. Say there are two teams collaborating on this dataset: 
team 1 modifies Vi to derive V2, while team 2 modifies Vi to derive 
V 3 . Then, V 2 and V 3 are merged and give V 5 . As presented in Fig¬ 
ure^ Vi is associated with (10000,10000), indicating that VFs 
storage cost and recreation cost are both 10000 when stored in its 
entirety (we note that these two are typically measured in different 
units - see the second challenge below); the edge (Vi ^ V 3 ) is an¬ 
notated with (1000,3000), where 1000 is the storage cost for V 3 
when stored as the modification from Vi (we call this the delta of 
Vs from Vi) and 3000 is the recreation cost for Vs given Vi, i.e, the 
time taken to recreate Vs given that Vi has already been recreated. 

One naive solution to store these datasets would be to store all 
of them in their entirety (Figure^(ii)). In this case, each version 
can be retrieved directly but the total storage cost is rather large, 
i.e., 10000 + 10100 + 9700 + 9800 + 10120 = 49720. At the 
other extreme, only one version is stored in its entirety while other 
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Figure 1: (i) A version graph over 5 datasets - annotation (a, b) 
indicates a storage cost of a and a recreation cost of b; (ii, hi, iv) 
three possible storage graphs 


versions are stored as modifications or deltas to that version, as 
shown in Figure^(iii). The total storage cost here is much smaller 
flOOOO + 200 + 1000 + 50 + 200 = 11450j, but the recreation 
cost is large for 1^2, Vs, 14 and V5. For instance, the path {(Vi ^ 
Vs ^ V5)} needs to be accessed in order to retrieve V5 and the 
recreation cost is 10000 + 3000 + 550 = 13550 > 10120. 

Figure^(iv) shows an intermediate solution that trades off in¬ 
creased storage for reduced recreation costs for some version. Here 
we store versions Vi and Vs in their entirety and store modifica¬ 
tions to other versions. This solution also exhibits higher storage 
cost than solution (ii) but lower than (Hi), and still results in signif¬ 
icantly reduced retrieval costs for versions Vs and V5 over (ii). 

Despite the fundamental nature of the storage-retrieval problem, 
there is surprisingly little prior work on formally analyzing this 
trade-off and on designing techniques for identifying effective stor¬ 
age solutions for a given collection of datasets. Version Control 
Systems (VCS) like Git, SVN, or Mercurial, despite their popular¬ 
ity, use fairly simple algorithms underneath, and are known to have 
significant limitations when managing large datasets m Much 
of the prior work in literature focuses on a linear chain of versions, 
or on minimizing the storage cost while ignoring the recreation cost 
(we discuss the related work in more detail in Section[^. 

In this paper, we initiate a formal study of the problem of decid¬ 
ing how to jointly store a collection of dataset versions, provided 
along with a version or derivation graph. Aside from being able 
to handle the scale, both in terms of dataset sizes and the number 
of versions, there are several other considerations that make this 
problem challenging. 

• Different application scenarios and constraints lead to many 
variations on the basic theme of balancing storage and recre¬ 
ation cost (see Table[^. The variations arise both out of dif¬ 
ferent ways to reconcile the conflicting optimization goals, 
as well as because of the variations in how the differences be¬ 
tween versions are stored and how versions are reconstructed. 
For example, some mechanisms for constructing differences 
between versions lead to symmetric differences (either ver¬ 
sion can be recreated from the other version) — we call this 
the undirected case. The scenario with asymmetric, one-way 
differences is referred to as directed case. 

• Similarly, the relationship between storage and recreation 


costs leads to significant variations across different settings. 
In some cases the recreation cost is proportional to the stor¬ 
age cost (e.g., if the system bottleneck lies in the I/O cost 
or network communication), but that may not be true when 
the system bottleneck is CPU computation. This is especially 
true for sophisticated differencing mechanisms where a com¬ 
pact derivation procedure might be known to generate one 
dataset from another. 

• Another critical issue is that computing deltas for all pairs of 
versions is typically not feasible. Relying purely on the ver¬ 
sion graph may not be sufficient and significant redundancies 
across datasets may be missed. 

• Further, in many cases, we may have information about rel¬ 
ative access frequencies indicating the relative likelihood of 
retrieving different datasets. Several baseline algorithms for 
solving this problem cannot be easily adapted to incorporate 
such access frequencies. 

We note that the problem described thus far is inherently “online” 
in that new datasets and versions are typically being created contin¬ 
uously and are being added to the system. In this paper, we focus 
on the static, off-line version of this problem and focus on formally 
and completely understanding that version. We plan to address the 
online version of the problem in the future. The key contributions 
of this work are as follows. 

• We formally define and analyze the dataset versioning prob¬ 
lem and consider several variations of the problem that trade 
off storage cost and recreation cost in different manners, un¬ 
der different assumptions about the differencing mechanisms 
and recreation costs (Section [^. Table summarizes the 
problems and our results. We show that most of the varia¬ 
tions of this problem are NP-Hard (Section [^. 

• We provide two light-weight heuristics: one, when there is 
a constraint on average recreation cost, and one when there 
is a constraint on maximum recreation cost; we also show 
how we can adapt a prior solution for balancing minimum- 
spanning trees and shortest path trees for undirected graphs 
(Section]^. 

• We have built a prototype system where we implement the 
proposed algorithms. We present an extensive experimen¬ 
tal evaluation of these algorithms over several synthetic and 
real-world workloads demonstrating the effectiveness of our 
algorithms at handling large problem sizes (Section [^. 

2. PROBLEM OVERVIEW 

In this section, we first introduce essential notations and then 
present the various problem formulations. We then present a map¬ 
ping of the basic problem to a graph-theoretic problem, and also 
describe an integer linear program to solve the problem optimally. 

2.1 Essential Notations and Preliminaries 

Version Graph. We let V = {U},z = 1,... ,n be a collec¬ 
tion of versions. The derivation relationships between versions are 
represented or captured in the form of a version graph: ^(V, 8 ). 
A directed edge from Vi to Vj in Q(V^8) represents that Vj was 
derived from Vi (either through an update operation, or through 
an explicit transformation). Since branching and merging are per¬ 
mitted in DataHub (admitting collaborative data science), ^ is a 
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Table 1: Problem Variations With Different Constraints, Objectives and Scenarios. 


DAG (directed acyclic graph) instead of a linear chain. For exam¬ 
ple, Figure [^represents a version graph Q, where V 2 and V 3 are 
derived from Vi separately, and then merged to form V 5 . 

Storage and Recreation. Given a collection of versions V, we 
need to reason about the storage cost, i.e., the space required to 
store the versions, and the recreation cost, i.e., the time taken to 
recreate or retrieve the versions. For a version Vi, we can either: 

• Store Vi in its entirety: in this case, we denote the storage re¬ 
quired to record version Vi fully by The recreation cost 
in this case is the time needed to retrieve this recorded ver¬ 
sion; we denote that by A version that is stored in its 
entirety is said to be materialized. 

• Store a “delta” from Vj : in this case, we do not store Vi fully; 
we instead store its modifications from another version Vj . 
For example, we could record that Vi is just Vj but with the 
50th tuple deleted. We refer to the information needed to 
construct version Vi from version Vj as the delta from Vj to 
Vi . The algorithm giving us the delta is called a differencing 
algorithm. The storage cost for recording modifications from 
Vj, i.e., the size the delta, is denoted by The recreation 
cost is the time needed to recreate the recorded version given 
that Vj has been recreated; this is denoted by 

Thus the storage and recreation costs can be represented using two 
matrices A and T>: the entries along the diagonal represent the costs 
for the materialized versions, while the off-diagonal entries repre¬ 
sent the costs for deltas. From this point forward, we focus our 
attention on these matrices: they capture all the relevant informa¬ 
tion about the versions for managing and retrieving them. 

Delta Variants. Notice that by changing the differencing algo¬ 
rithm, we can produce deltas of various types: 

• for text files, UNIX-style diffs, i.e., line-by-line modifica¬ 
tions between versions, are commonly used; 
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Figure 2: Matrices corresonding to the example in Figure 1 (with 
additional entries revealed beyond the ones given by version graph) 


The reader may be wondering why we need to reason about two 
matrices A and In some cases, the two may be proportional to 
each other (e.g., if we are using uncompressed UNIX-style diffs). 
But in many cases, the storage cost of a delta and the recreation cost 
of applying that delta can be very different from each other, espe¬ 
cially if the deltas are stored in a compressed fashion. Furthermore, 
while the storage cost is more straightforward to account for in that 
it is proportional to the bytes required to store the deltas between 
versions, recreation cost is more complicated: it could depend on 
the network bandwidth (if versions or deltas are stored remotely), 
the I/O bandwidth, and the computation costs (e.g., if decompres¬ 
sion or running of a script is needed). 

Example 2. Figure^shows the matrices A and $ based on 
version graph in Figure^^ The annotation associated with the edge 
(U, Vj) in Figure^is essentially {Aij, ^ij), whereas the vertex 
annotation for Vi is (Ai^i,^i^i). If there is no edge from Vi to Vj 
in the version graph, we have two choices: we can either set the 
corresponding A and T> entries to ” (unknown) (as shown in 
the figure), or we can explicitly compute the values of those entries 
(by running a differencing algorithm). For instance, A3,2 = 1100 
and T> 3,2 = 3200 are computed explicitly in the figure (the specific 
numbers reported here are fictitious and not the result of running 
any specific algorithm). 


• we could have a listing of a program, script, SQL query, or 
command that generates version Vi from Vj ; 

• for some types of data, an XOR between the two versions 
can be an appropriate delta; and 

• for tabular data (e.g., relational tables), recording the differ¬ 
ences at the cell level is yet another type of delta. 

Furthermore, the deltas could be stored compressed or uncompressed. 
The various delta variants lead to various dimensions of problem 
that we will describe subsequently. 


Discussion. Before moving on to formally defining the basic opti¬ 
mization problem, we note several complications that present unique 
challenges in this scenario. 

• Revealing entries in the matrix: Ideally, we would like to 
compute all pairwise A and T> entries, so that we do not 
miss any significant redundancies among versions that are 
far from each other in the version graph. However when 
the number of versions, denoted n, is large, computing all 
those entries can be very expensive (and typically infeasi¬ 
ble), since this means computing deltas between all pairs of 




























versions. Thus, we must reason with incomplete A and T> 
matrices. Given a version graph Q, one option is to restrict 
our deltas to correspond to actual edges in the version graph; 
another option is to restrict our deltas to be between “close 
by” versions, with the understanding that versions close to 
each other in the version graph are more likely to be simi¬ 
lar. Prior work has also suggested mechanisms (e.g., based 
on hashing) to find versions that are close to each other fT^ . 
We assume that some mechanism to choose which deltas to 
reveal is provided to us. 

• Multiple “delta ” mechanisms: Given a pair of versions {Vi,Vj), 
there could be many ways of maintaining a delta between 
them, with different Aij, costs. For example, we can 
store a program used to derive Vj from Vi, which could take 
longer to run (i.e., the recreation cost is higher) but is more 
compact (i.e., storage cost is lower), or explicitly store the 
UNIX-style diffs between the two versions, with lower recre¬ 
ation costs but higher storage costs. For simplicity, we pick 
one delta mechanism: thus the matrices A, T> just have one 
entry per (z, j) pair. Our techniques also apply to the more 
general scenario with small modifications. 

• Branches: Both branching and merging are common in col¬ 
laborative analysis, making the version graph a directed acyclic 
graph. In this paper, we assume each version is either stored 
in its entirety or stored as a delta from a single other version, 
even if it is derived from two different datasets. Although 
it may be more efficient to allow a version to be stored as 

a delta from two other versions in some cases, representing 
such a storage solution requires more complex constructs and 
both the problems of finding an optimal storage solution for 
a given problem instance and retrieving a specific version be¬ 
come much more complicated. We plan to further study such 
solutions in future. 

Matrix Properties and Problem Dimensions. The storage cost 
matrix A may be symmetric or asymmetric depending on the spe¬ 
cific differencing mechanism used for constructing deltas. For ex¬ 
ample, the XOR differencing function results in a symmetric A ma¬ 
trix since the delta from a version Vi to Vj is identical to the delta 
from Vj to Vi. UNIX-style diffs where line-by-line modifications 
are listed can either be two-way (symmetric) or one-way (asym¬ 
metric). The asymmetry may be quite large. For instance, it may 
be possible to represent the delta from Vi to Vj using a command 
like: delete all tuples with age > 60, very compactly. However, the 
reverse delta from Vj to Vi is likely to be quite large, since all the 
tuples that were deleted from Vi would be a part of that delta. In 
this paper, we consider both these scenarios. We refer to the sce¬ 
nario where A is symmetric and A is asymmetric as the undirected 
case and directed case, respectively. 

A second issue is the relationship between T> and A. In many 
scenarios, it may be reasonable to assume that T> is proportional to 
A. This is generally true for deltas that contain detailed line-by-line 
or cell-by-cell differences. It is also true if the system bottleneck 
is network communication or I/O cost. In a large number of cases, 
however, it may be more appropriate to treat them as independent 
quantities with no overt or known relationship. For the propor¬ 
tional case, we assume that the proportionality constant is 1 (i.e., 

T> = A); the problem statements, algorithms and guarantees are 
unaffected by having a constant proportionality factor. The other 
case is denoted by T> / A. 

This leads us to identify three distinct cases with significantly 
diverse properties: (1) Scenario 1: Undirected case, T> = A; (2) 


Scenario 2: Directed case, T> = A; and (3) Scenario 3: Directed 
case, T> / A. 

Objective and Optimization Metrics. Given A, T>, our goal is to 
find a good storage solution, i.e., we need to decide which versions 
to materialize and which versions to store as deltas from other ver¬ 
sions. Let V — {(/i,ii), (^ 2 , J 2 ), •••} denote a storage solution. 
ik = jk indicates that the version Vi^ is materialized (i.e., stored 
explicitly in its entirety), whereas a pair (z/c, j/c), i/c 7 ^ jk indicates 
that we store a delta from to . 

We require any solution we consider to be a valid solution, where 
it is possible to reconstruct any of the original versions. More for¬ 
mally, V is considered a valid solution if and only if for every ver¬ 
sion Vi, there exists a sequence of distinct versions Vi-^, ...,Vi^ = 
Vi such that •••, [k^-i^kk) are con¬ 

tained in V (in other words, there is a version Vi-^ that can be ma¬ 
terialized and can be used to recreate Vi through a chain of deltas). 

We can now formally define the optimization goals: 

• Total Storage Cost (denoted C): The total storage cost for a 
solution V is simply the storage cost necessary to store all the 
materialized versions and the deltas: C = '^(^i j)^^^ ^i,j- 

• Recreation Cost for Vi (denoted lZi)\ Let ..., = U 

denote a sequence that can be used to reconstruct Vi. The 
cost of recreating Vi using that sequence is: + 

••• + recreation cost for Vi is the minimum of 

these quantities over all sequences that can be used to recre¬ 
ate Vi. 

Problem Formulations. We now state the problem formulations 
that we consider in this paper, starting with two base cases that 
represent two extreme points in the spectrum of possible problems. 

Problem 1 (Minimizing Storage). Given A, T>, find a 
valid solution V such that C is minimized. 

Problem 2 (Minimizing Recreation). Given A, T>, iden¬ 
tify a valid solution V such that Vz, Ri is minimized. 

The above two formulations minimize either the storage cost or 
the recreation cost, without worrying about the other. It may ap¬ 
pear that the second formulation is not well-defined and we should 
instead aim to minimize the average recreation cost across all ver¬ 
sions. However, the (simple) solution that minimizes average recre¬ 
ation cost also naturally minimizes IZi for each version. 

In the next two formulations, we want to minimize (a) the sum of 
recreation costs over all versions ff^i (b) the max recreation 
cost across all versions (max^ IZi), under the constraint that total 
storage cost C is smaller than some threshold /3. These problems 
are relevant when the storage budget is limited. 

Problem 3 (MinSum Recreation). Given A, T> and a th¬ 
reshold 13, identify V such that C < {3, and IZi is minimized. 

Problem 4 (MinMax Recreation). Given A , ^ and a th¬ 

reshold [3, identify V such that C < [3, and max^ TZi is minimized. 

The next two formulations seek to instead minimize the total 
storage cost C given a constraint on the sum of recreation costs 
or max recreation cost. These problems are relevant when we want 
to reduce the storage cost, but must satisfy some constraints on the 
recreation costs. 

Problem 5 (Minimizing Storage(Sum Recreation)). 
Given A, T> and a threshold 0, identify V such that 'ff^i ^ 
and C is minimized. 




Figure 3: Graph G 


Figure 4: Storage Graph Gs 


Proof. Recall that a spanning tree of a graph G(V, E) is a sub¬ 
graph of G that (i) includes all vertices of G, (ii) is connected, i.e., 
every vertex is reachable from every other vertex, and (iii) has no 
cycles. Any Gs must satisfy (i) and (ii) in order to ensure that a 
version Vi can be recreated from Vb by following the path from Vb 
to Vi. Conversely, if a subgraph satisfies (i) and (ii), it is a valid 
Gs according to our definition above. Regarding (iii), presence of 
a cycle creates redundancy in Gs- Formally, given any subgraph 
that satisfies (i) and (ii), we can arbitrarily delete one from each of 
its cycle until the subgraph is cycle free, while preserving (i) and 
(ii). □ 


Problem 6 (Minimizing Storage(Max Recreation)). 
Given A, $ and a threshold 0, identify V such that max^ IZi < 0, 
and C is minimized. 

2.2 Mapping to Graph Formulation 

In this section, we’ll map our problem into a graph problem, 
that will help us to adopt and modify algorithms from well-studied 
problems such as minimum spanning tree construction and delay- 
constrained scheduling. Given the matrices A and we can con¬ 
struct a directed, edge-weighted graph G = (V^E) representing 
the relationship among different versions as follows. For each ver¬ 
sion Vi, we create a vertex Vi 'mG. In addition, we create a dummy 
vertex Vb in G. For each Vi, we add an edge Vb ^ Vi, and assign 
its edge-weight as a tuple (Ai,^, Next, for each Ai,j / oc, 
we add an edge Vi Vj with edge-weight {Aij, ^ij). 

The resulting graph G is similar to the original version graph, but 
with several important differences. An edge in the version graph 
indicates a derivation relationship, whereas an edge in G simply 
indicates that it is possible to recreate the target version using the 
source version and the associated edge delta (in fact, ideally G is a 
complete graph). Unlike the version graph, G may contain cycles, 
and it also contains the special dummy vertex Vb- Additionally, 
in the version graph, if a version Vi has multiple in-edges, it is 
the result of a user/application merging changes from multiple ver¬ 
sions into Vi. However, multiple in-edges in G capture the multiple 
choices that we have in recreating Vi from some other versions. 

Given graph G = (V, F^), the goal of each of our problems is to 
identify a storage graph Gs — (14, F^s), a subset of G, favorably 
balancing total storage cost and the recreation cost for each ver¬ 
sion. Implicitly, we will store all versions and deltas corresponding 
to edges in this storage graph. (We explain this in the context of the 
example below.) We say a storage graph Gs is feasible for a given 
problem if (a) each version can be recreated based on the informa¬ 
tion contained or stored in Gs, (b) the recreation cost or the total 
storage cost meets the constraint listed in each problem. 

Example 3. Given matrix A and T> in Figure^i) and ^ii), 
the corresponding graph G is shown in Figure^ Every version is 
reachable from Vb. For example, edge (Vb, Ui) is weighted with 
(Ai,i,T>i,i) = (10000,10000); edge {V^^V^) is weighted with 
(As, 5 , T> 3 , 5 ) = (800, 2500). Figure^is a feasible storage graph 
given G in Figure^ where Vi and Vb are materialized (since the 
edges from Vb to Vi and Vb are present) while 14, V 4 and Vb are 
stored as modifications from other versions. 

After mapping our problem into a graph setting, we have the 
following lemma. 

Lemma 1. The optimal storage graph Gs — {Vs, Es) for all 6 
problems listed above must be a spanning tree T rooted at dummy 
vertex Vb in graph G. 


For Problems and we have the following observations. A 
minimum spanning tree is defined as a spanning tree of smallest 
weight, where the weight of a tree is the sum of all its edge weights. 
A shortest path tree is defined as a spanning tree where the path 
from root to each vertex is a shortest path between those two in the 
original graph: this would be simply consist of the edges that were 
explored in an execution of Dijkstra’s shortest path algorithm. 

Lemma 2. The optimal storage graph Gs for Problem^is a 
minimum spanning tree of G rooted at Vb, considering only the 
weights Aij. 

Lemma 3. The optimal storage graph Gs for Problem^is a 
shortest path tree of G rooted at Vb, considering only the weights 

2.3 ILP Formulation 

We present an ILP formulation of the optimization problems de¬ 
scribed above. Here, we take Problem as an example; other 
problems are similar. Let Xij be a binary variable for each edge 
(Vi, Vj) G E, indicating whether edge {Vi, Vj) is in the storage 
graph or not. Specifically, = 1 indicates that version Vj is 
materialized, while Xi^j = 1 indicates that the modification from 
version i to version j is stored where z / 0. Let Vi be a continuous 
variable for each vertex Vi G V, where ro = 0; captures the 
recreation cost for version i (and must be < ^). 

minimize x Ai,j, subject to: 

1- = i.vj 

2. Tj — n > ^ij if Xij = 1 

3. Vi < o,yi 


Lemma 4. Problem^is equivalent to the optimization problem 
described above. 

Note however that the general form of an ILP does not permit an 
if-then statement (as in (2) above). Instead, we can transform to the 
general form with the aid of a large constant G. Thus, constraint 2 
can be expressed as follows: 

^i,j + U - rj < (1 - Xij) X G 

Where G is a “sufficiently large” constant such that no additional 
constraint is added to the model. For instance, G here can be set as 
2^0. On one hand, if Xi^j = 1 ^ j + — Vj < 0. On the other 
hand, if Xi^j = 0 ^ + n — Vj < G. Since G is “sufficiently 

large”, no additional constraint is added. 






3. COMPUTATIONAL COMPLEXITY 

In this section, we study the complexity of the problems listed in 
Table[^under different application scenarios. 

Problem [^and Complexity. As discussed in Section Prob- 
lem[^and[^can be solved in polynomial time by directly applying 
a minimum spanning tree algorithm (Kruskal’s algorithm or Prim’s 
algorithm for undirected graphs; Edmonds’ algorithm (38) for di¬ 
rected graphs) and Dijkstra’s shortest path algorithm respectively. 
Kruskal’s algorithm has time complexity 0{E log V), while Prim’s 
algorithm also has time complexity 0{E log V) when using binary 
heap for implementing the priority queue, and 0{E -\- 1/logl/) 
when using Fibonacci heap for implementing the priority queue. 
The running time of Edmonds’ algorithm is 0(EV) and can be re¬ 
duced to 0{E + V log V) with faster implementation. Similarly, 
Dijkstra’s algorithm for constructing the shortest path tree starting 
from the root has a time complexity of 0{E\ogV) via a binary 
heap-based priority queue implementation and a time complexity 
of 0{E V log V) via Fibonacci heap-based priority queue im¬ 
plementation. 

Next, we’ll show that Problemandare NP-hard even for the 
special case where A = T> and is symmetric. This will lead to 
hardness proofs for the other variants. 

Triangle Inequality. The primary challenge that we encounter 
while demonstrating hardness is that our deltas must obey the trian¬ 
gle inequality: unlike other settings where deltas need not obey real 
constraints, since, in our case, deltas represent actual modifications 
that can be stored, it must obey additional realistic constraints. This 
causes severe complications in proving hardness, often transform¬ 
ing the proofs from very simple to fairly challenging. 

Consider the scenario when A = T> and T> is symmetric. We take 
A as an example. The triangle inequality, can be stated as follows: 

I ^ ^■p^w ^ ^'PiQ ^■q,w 
|Ap,p — Ap,g| < Ag,g < Ap,p + ^P,q 

where p,q,w G V and p ^ q ^ w. The first inequality states that 
the “delta” between two versions can not exceed the total “deltas” 
of any two-hop path with the same starting and ending vertex; while 
the second inequality indicates that the “delta” between two ver¬ 
sions must be bigger than one version’s full storage cost minus an¬ 
other version’s full storage cost. Since each tuple and modification 
is recorded explicitly when T> is symmetric, it is natural that these 
two inequalities hold. 



(a) (b) 

Figure 5: Illustration of Proof of Lemmaj^ 

Problem [6] Hardness. We now demonstrate hardness. 

Lemma 5. Problem^is NP-hard when A = T> and T> is sym¬ 


metric. 

Proof. Here we prove NP-hardness using a reduction from the 
set cover problem. Recall that in the set cover problem, we are 
given m sets S = {si, S2,Sm} and n items T = {ti, t2, ...tn}, 
where each set Si covers some items, and the goal is to pick k sets 
E C S such that U^p^j^yE = T while minimizing k. 

Given a set cover instance, we now construct an instance of Prob- 
lemj^that will provide a solution to the original set cover problem. 
The threshold we will use in Problem]^ will be {j 3 + l)o, where 
/3, o are constants that are each greater than 2{m + n). (This is 
just to ensure that they are “large”.) We now construct the graph 
G{V,E) in the following way; we display the constructed graph in 
Figure]^ Our vertex set V is as follows: 

• Wsi G S, create a vertex Si in V. 

• \/ti G T, create a vertex ti in V. 

• create an extra vertex uo, two dummy vertices vi , V 2 in V. 

We add the two dummy vertices simply to ensure that vq is mate¬ 
rialized, as we will see later. We now define the storage cost for 
materializing each vertex in V in the following way: 

• \/si G S, the cost is a. 

• \/ti G T, the cost is {/3 1)a. 

• for vertex vq, the cost is a. 

• for vertex vi,V 2 , the cost is (/3 + l)a. 

(These are the numbers colored blue in the tree of Figure [^b).) 
As we can see above, we have set the costs in such a way that the 
vertex vq and the vertices corresponding to sets in S have low ma¬ 
terialization cost, while the other vertices have high materialization 
cost: this is by design so that we only end up materializing these 
vertices. Our edge set E is now as follows. 

• we connect vertex vq to each Si with weight 1. 

• we connect vq to both vi and V 2 each with weight ^a. 

• Vsi G S, we connect si to tj with weight (da when tj G Si, 
where a = \V\. 

It is easy to show that our constructed graph G obeys the triangle 
inequality. 

Consider a solution to Problem on the constructed graph G. 
We now demonstrate that that solution leads to a solution of the 
original set cover problem. Our proof proceeds in four key steps: 
Step 1: The vertex vq will be materialized, while vi, V 2 will not be 
materialized. Assume the contrary—say vq is not materialized in a 
solution to Problem!^ Then, both vi and V 2 must be materialized, 
because if they are not, then the recreation cost of vi and V 2 would 
be at least o(/3 + 1) + 1, violating the condition of Problem]^ 
However we can avoid materializing vi and U 2 , instead keep the 
delta to Vq and materialize uo, maintaining the recreation cost as 
is while reducing the storage cost. Thus vq has to be materialized, 
while vi , V 2 will not be materialized. (Our reason for introducing 
vi , V 2 is precisely to ensure that vq is materialized so that it can 
provide basis for us to store deltas to the sets s^.) 

Step 2: None of the ti will be materialized. Say a given ti is mate¬ 
rialized in the solution to Problem]^ Then, either we have a set Sj 
where Sj is connected to ti in Figure|^a) also materialized, or not. 
Let’s consider the former case. In the former case, we can avoid 
materializing ti, and instead add the delta from sj to ti, thereby 



reducing storage cost while keeping recreation cost fixed. In the 
latter case, pick any Sj such that Sj is connected to ti and is not 
materialized. Then, we must have the delta from vq to sj as part 
of the solution. Here, we can replace that edge, and materialized 
ti, with materialized Sj, and the delta from Sj to ti: this would re¬ 
duce the total storage cost while keeping the recreation cost fixed. 
Thus, in either case, we can improve the solution if any of the ti 
are materialized, rendering the statement false. 

Step 3: For each Si, either it is materialized, or the edge from vq to 
Si will be part of the storage graph. This step is easy to see: since 
none of the ti are materialized, either each Si has to be materialized, 
or we must store a delta from vq. 

Step 4: The sets Si that are materialized correspond to a minimal 
set cover of the original problem. It is easy to see that for each 
tj we must have an Si such that Si covers tj, and Si is material¬ 
ized, in order for the recreation cost constraint to not be violated 
for tj. Thus, the materialized Si must be a set cover for the orig¬ 
inal problem. Furthermore, in order for the storage cost to be as 
small as possible, as few Si as possible must be materialized (this 
is the only place we can save cost). Thus, the materialized Si also 
correspond to a minimal set cover for the original problem. 

Thus, minimizing the total storage cost is equivalent to minimiz¬ 
ing k in set cover problem. □ 

Note that while the reduction above uses a graph with only some 
edge weights (i.e., recreation costs of the deltas) known, a similar 
reduction can be derived for a complete graph with all edge weights 
known. Here, we simply use the shortest path in the graph reduc¬ 
tion above as the edge weight for the missing edges. In that case, 
once again, the storage graph in the solution to Problem will be 
identical to the storage graph described above. 

Problem|5]Hardness: We now show that Problem[5]is NP-Hard as 
well. The general philosophy is similar to the proof in Lemma 
except that we create c dummy vertices instead of two dummy ver¬ 
tices vi , V 2 in Lemma where c is sufficiently large—this is to 
once again ensure that vq is materialized. 

Lemma 6. Problem^is NP-Hard when A = T> and T> is sym¬ 
metric. 



Figure 6: Illustration of Proof of Lemmaj^ 


Proof. We prove NP-hardness using a reduction from the set 
cover problem. Recall that in the set cover decision problem, we are 
given m sets S = {si, S 2 ,Sm} and n items T = {ti, t 2 , ...tn}, 
where each set Si covers some items, and given a /c, we ask if there 
a subset T d S such that = T and \T\ < k. 

Given a set cover instance, we now construct an instance of Prob¬ 
lem]^ that will provide a solution to the original set cover decision 
problem. The corresponding decision problem for Problem]^ is: 
given threshold a + (/3 + l)an-\-ka + (m — A:)(a+ 1) + (a+ l)c 
in Problem is the minimum total storage cost in the constructed 
graph G no bigger than a-\- ka-\- (m — k) ajSn + c. 


We now construct the graph G{V^ E) in the following way; we 
display the constructed graph in Figure Our vertex set V is as 
follows: 

• \/si G S, create a vertex Si in V. 

• \/ti G T, create a vertex ti in V. 

• create an extra vertex vq , and c dummy vertices {vi , U 2 ,..., Uc 
in V. 

We add the c dummy vertices simply to ensure that vq is materi¬ 
alized, as we will see later. We now define the storage cost for 
materializing each vertex in V in the following way: 

• \/si G S, the cost is a. 

• \/ti G T, the cost is (/3 + l)a. 

• for vertex vq, the cost is a. 

• for each vertex in {vi , U 2 ,..., Vc}, the cost is a + 1. 

(These are the numbers colored blue in the tree of Figurej^) As we 
can see above, we have set the costs in such a way that the vertex vq 
and the vertices corresponding to sets in S have low materialization 
cost while the vertices corresponding to T have high materializa¬ 
tion cost: this is by design so that we only end up materializing 
these vertices. Even though the costs of the dummy vertices is 
close to that of uo, Si, we will show below that they will not be 
materialized either. Our edge set E is now as follows. 

• we connect vertex vq to each Si with weight 1. 

• we connect uo to Ui, 1 < i < c each with weight 1. 

• \/si G S, we connect Si to tj with weight (da when tj G Si, 
where a = \V\. 

It is easy to show that our constructed graph G obeys the triangle 
inequality. 

Consider a solution to Problem on the constructed graph G. 
We now demonstrate that that solution leads to a solution of the 
original set cover problem. Our proof proceeds in four key steps: 
Step 1: The vertex vq will be materialized, while 1 < z < c will 
not be materialized. Let’s examine the first part of this observation, 
i.e., that vq will be materialized. Assume the contrary. If vq is 
not materialized, then at least one , 1 < i < c, or one of the 
Si must be materialized, because if not, then the recreation cost 
of {ui, U 2 ,..., Vc} would be at least {a + 2)c > (a + l)c + 
a + (/3 + l)an ka {m — k){a 1), violating the condition 
(exceeding total recreation cost threshold) of ProblemHowever 
we can avoid materializing this Vi (or Si), instead keep the delta 
from Vi (or Si) to vq and materialize vq, reducing the recreation cost 
and the storage cost. Thus vq has to be materialized. Furthermore, 
since vq is materialized, , 1 < z < c will not be materialized 
and instead we will retain the delta to vq, reducing the recreation 
cost and the storage cost. Hence, the first step is complete. 

Step 2: None of the ti will be materialized. Say a given ti is mate¬ 
rialized in the solution to Problem]^ Then, either we have a set Sj 
where Sj is connected to ti in Figure|^a) also materialized, or not. 
Let us consider the former case. In the former case, we can avoid 
materializing ti, and instead add the delta from Sj to ti, thereby 
reducing storage cost while keeping recreation cost fixed. In the 
latter case, pick any Sj such that Sj is connected to U and is not 
materialized. Then, we must have the delta from vq to Sj as part of 
the solution. Here, we can replace that edge, and the materialized 






ti, with materialized sj, and the delta from sj to ti\ this would re¬ 
duce the total storage cost while keeping the recreation cost fixed. 
Thus, in either case, we can improve the solution if any of the ti 
are materialized, rendering the statement false. 

Step 3: For each Si, either it is materialized, or the edge from vq to 
Si will be part of the storage graph. This step is easy to see: since 
none of the ti are materialized, either each si has to be materialized, 
or we must store a delta from vq. 

Step 4: If the minimum total storage cost is no bigger than a + 
ka + (m — k) + c, then there exists a subset IF d S 

such that = T and \F\ < k in the original set cover 

decision problem, and vice versa. Let’s examine the first part. If 
the minimum total storage cost is no bigger than a F ka F (m — 
k) F a^nFc, then the storage cost for all Si E S must be no bigger 
than ka F {m — k) since the storage cost for vq, {vi , U 2 ,..., Uc} 
and {tiF 2 , ■ ■ ■ ,tn} is a, c and afSn respectively according to Step 
1 and 2. This indicates that at most k Si G S is materialized (we let 
the set of materialized Si be M and \ M\ < k). Next, we prove that 
each tj is stored as the modification from the materialized Si G 
M. Suppose there exists one or more tj which is stored as the 
modification from Si ^ S — M, then the total recreation cost must 
be more than a+((/3-l-l)an+l) + /ca+(m — A;)(a+l) + (a+l)c, 
which exceeds the total recreation threshold. Thus, we have each 
tj G T is stored as the modification from Si G M. Let T — M, 
we can obtain — T and \IF\ < k. Thus, If the minimum 

total storage cost is no bigger than aFkaF (jn — k) F OL^n F c, 
then there exists a subset IF d S such that = T and 

I < /c in the original set cover decision problem. 

Next let’s examine the second part. If there exists a subset T d 
S such that U^p^pyF = T and \F'\ < k in the original set cover 
decision problem, then we can materialize each vertex G as 
well as the extra vertex vq, connect vq to {ui, U 2 ,..., Uc} as well 
as Sj E S — IF, and connect tj to one Si E IF. The resulting total 
storage is a F ka F {m — k) F (xfSn F c and the total recreation 
cost equals to the threshold. Thus, if there exists a subset IF d S 
such that U^p^pjF = T and \F'\ < k in the original set cover 
decision problem, then the minimum total storage cost is no bigger 
than a F ka F {m — k) F (xfSn F c. 

Thus, the decision problem in Problem is equivalent to the 
decision problem in set cover problem. □ 

Once again, the problem is still hard if we use a complete graph as 
opposed to a graph where only some edge weights are known. 

Since Problem|^swaps the constraint and goal compared to Prob¬ 
lem it is similarly NP-Hard. (Note that the decision versions of 
the two problems are in fact identical, and therefore the proof still 
applies.) Similarly, Problemj^is also NP-Hard. Now that we have 
proved the NP-hard even in the special case where A = T> and T> 
is symmetric, we can conclude that Problem[^|^|^[^ are NP-hard 
in a more general setting where T> is not symmetric and A / T>, as 
listed in Table [T] 

Hop-Based Variants. So far, our focus has been on proving hard¬ 
ness for the special case where A = T> and A is undirected. We 
now consider a different kind of special case, where the recreation 
cost of all pairs is the same, i.e., = 1 for all i,j, while A T>, 

and A is undirected. In this case, we call the recreation cost as the 
hop cost, since it is simply the minimum number of delta operations 
(or "hops") needed to reconstruct Vi. 

The reason why we bring up this variant is that this directly cor¬ 
responds to a special case of the well-studied d-MinimumSteinerTree 
problem: Given an undirected graph G = iV^E) and a subset 
uj QV, find a tree with minimum weight, spanning the entire ver¬ 
tex subset uo while the diameter is bounded by d. The special case 


of d-MinimumSteinerTree problem when uj — V, i.e., the mini¬ 
mum spanning tree problem with bounded diameter, directly corre¬ 
sponds to Problem for the hop cost variant we described above. 
The hardness for this special case was demonstrated by p5) using 
a reduction from the SAT problem: 

Lemma 7. Problem^is NP-Hard when A T> and A is sym¬ 
metric, and ^ij = 1 for all i,j- 

Note that this proof crucially uses the fact that A T> unlike 
Lemmaandthus the proofs are incomparable (i.e., one does 
not subsume the other). 

For the hop-based variant, additional results on hardness of ap¬ 
proximation are known by way of the d-MinimumSteinerTree prob¬ 
lem |T^[T8l|^: 

Lemma 8 (||25l). For any e > 0, Problem^has no Inn-e 

approximation unless NP d 

Since the hop-based variant is a special case of the last column 
of Table this indicates that Problem for the most general case 
is similarly hard to approximate; we suspect similar results hold 
for the other problems as well. It remains to be seen if hardness of 
approximation can be demonstrated for the variants in the second 
and third last columns. 

4. PROPOSED ALGORITHMS 

As discussed in Section our different application scenarios 
lead to different problem formulations, spanning different constraints 
and objectives, and different assumptions about the nature of T>, A. 

Given that we demonstrated in the previous section that all the 
problems are NP-Hard, we focus on developing efficient heuristics. 
In this section, we present two novel heuristics: first, in Section [TT] 
we present LMG, or the Local Move Greedy algorithm, tailored to 
the case when there is a bound or objective on the average recre¬ 
ation cost, thus, this applies to Problems andSecond, in Sec- 
tion |4.2| we present MP, or Modified Prim’s algorithm, tailored to 
the case when there is a bound or objective on the maximum recre¬ 
ation cost’, thus, this applies to Problems]^ andWe present two 
variants of the MP algorithm tailored to two different settings. 

Then, we present two algorithms — in Section |4.3| we present 
an approximation algorithm called LAST, and in Section [44] we 
present an algorithm called GitH which is based on Git repack. 
Both of these are adapted from literature to fit our problems and 
we compare these against our algorithms in Section Note that 
LAST does not explicitly optimize any objectives or constraints in 
the manner of LMG, MP, or GitH, and thus the four algorithms 
are applicable under different settings; LMG and MP are applica¬ 
ble when there is a bound or constraint on the average or maxi¬ 
mum recreation cost, while LAST and GitH are applicable when a 
“good enough” solution is needed. Furthermore, note that all these 
algorithms apply to both directed and undirected versions of the 
problems, and to the symmetric and unsymmetric cases. 

4.1 Local Move Greedy Algorithm 

The LMG algorithm is applicable when we have a bound or con¬ 
straint on the average case recreation cost. We focus on the case 
where there is a constraint on the storage cost (Problem|^; the case 
when there is no such constraint (Problem can be solved by re¬ 
peated iterations and binary search on the previous problem. 

Outline. At a high level, the algorithm starts with the Minimum 
Spanning Tree (MST) as Gs, and then greedily adds edges from 
the Shortest Path Tree (SPT) that are not present in Gs, while Gs 
respects the bound on storage cost. 




Figure 7: Illustration of Local Move Greedy Heuristic 


Detailed Algorithm. The algorithm starts off with Gs equal to 
the MST. The SPT naturally contains all the edges corresponding 
to complete versions. The basic idea of the algorithm is to replace 
deltas in Gs with versions from the SPT that maximize the follow¬ 
ing ratio: 

reduction in sum of recreation costs 

P = -^^- 

increase m storage cost 

This is simply the reduction in total recreation cost per unit addition 
of weight to the storage graph Gs- 

Let ^ consists of edges in the SPT not present in the Gs (these 
precisely correspond to the versions that are not explicitly stored in 
the MST, and are instead computed via deltas in the MST). At each 
“round”, we pick the edge Cuv € ^ that maximizes p, and replace 
previous edge Cu'v to v. The reduction in the sum of the recreation 
costs is computed by adding up the reductions in recreation costs 
of all G Gs that are descendants of v in the storage graph (in¬ 
cluding V itself). On the other hand, the increase in storage cost is 
simply the weight of Cuv minus the weight of eu'v This process is 
repeated as long as the storage budget is not violated. We explain 
this with the means of an example. 

Example 4. Figure^a) denotes the current Gs- Node 0 cor¬ 
responds to the dummy node. Now, we are considering replacing 
edge ei 4 with edge eo 4 , that is, we are replacing a delta to ver¬ 
sion 5 with version 5 itself. Then, the denominator of p is sim¬ 
ply A 04 — A 14 . And the numerator is the changes in recreation 
costs of versions 4, 5, and 6 (notice that 5 and 6 were below 4 
in the tree.) This is actually simple to compute: it is simply three 
times the change in the recreation cost of version 4 (since it affects 
all versions equally). Thus, we have the numerator of p is simply 
3 X (T>oi + T>i 4 — T>o4). 

Complexity. For a given round, computing p for a given edge 
is 0(|y|). This leads to an overall 0(|y|^) complexity, since we 
have up to |y| rounds, and upto |y| edges in However, if we 
are smart about this computation (by precomputing and maintain¬ 
ing across all rounds the number of nodes “below” every node), 
we can reduce the complexity of computing p for a given edge to 
0(1). This leads to an overall complexity of 0(|y|^) Algorithm[^ 
provides a pseudocode of the described technique. 

Access Frequencies. Note that the algorithm can easily take into 
account access frequencies of different versions and instead opti¬ 
mize for the total weighted recreation cost (weighted by access fre¬ 
quencies). The algorithm is similar, except that the numerator of p 
will capture the reduction in weighted recreation cost. 

4.2 Modified Prim’s Algorithm 

Next, we introduce a heuristic algorithm based on Prim’s algo¬ 
rithm for Minimum Spanning Trees for Problem where the goal 


Algorithm 1: Local Move Greedy Heuristic 
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Input : Minimum Spanning Tree (MST), Shortest Path Tree 
(SPT), source vertex Vb, space budget W 
Output: A tree T with weight < W rooted at Vb with minimal 
sum of access cost 
Initialize T as MST. 

Let ti(Vi) be the distance from Vb to Vi in T, and piVi) denote 
the parent of Vi in T. Let W (T) denote the storage cost of T. 
while WiT) < IV do 

(pmax^espr) ^ (O?0) 

foreach Cuv ^ ^ do 

compute pe 

if Pe ^ Pmax thcU 
I {Pmax^G.) i ^Pe-i^uv) 

end 

end 

T i T \ Cu'v U Cuvi f f \ Guv 

if ^ = 0 then 
I return T 
end 

end 


is to reduce total storage cost while recreation cost for each version 
is within threshold O’, the solution for Problem|^is similar. 

Outline. At a high level, the algorithm is a variant of Prim’s al¬ 
gorithm, greedily adding the version with smallest storage cost and 
the corresponding edge to form a spanning tree T. Unlike Prim’s 
algorithm where the spanning tree simply grows, in this case, even 
if an edge is present in T, it could be removed in future iterations. 
At all stages, the algorithm maintains the invariant that the recre¬ 
ation cost of all versions in T is bounded within 0. 

Detailed Algorithm. At each iteration, the algorithm picks the ver¬ 
sion Vi with the smallest storage cost to be added to the tree. Once 
this version Vi is added, we consider adding all deltas to all other 
versions Vj such that their recreation cost through Vi is within the 
constraint 0, and the storage cost does not increase. Each version 
maintains a pair I (Vi) and diVi)’. I (Vi) denotes the marginal stor¬ 
age cost of Vi, while diVi) denotes the total recreation cost of Vi. 
At the start, I (Vi) is simply the storage cost of Vi in its entirety. 

We now describe the algorithm in detail. Set X represents the 
current version set of the current spanning tree T. Initially X = 
0. In each iteration, the version Vi with the smallest storage cost 
(I(Vi)) in the priority queue PQ is picked and added into spanning 
tree T (line 7-8). When Vi is added into T, we need to update the 
storage cost and recreation cost for all Vj that are neighbors of Vi. 
Notice that in Prim’s algorithm, we do not need to consider neigh¬ 
bors that are already in T. However, in our scenario a better path to 
such a neighbor may be found and this may result in an update(line 
10-17). For instance, if edge (V, Vj) can make Vj's storage cost 
smaller while the recreation cost for Vj does not increase, we can 
update p{Vj) = V as well as d{Vj), l{Vj) and T. For neighbors 
V,- 0 T(line 19-24), we update d{Vj), l{Vj),p{Vj) if edge (V, Vj) 
can make V^ ’s storage cost smaller and the recreation cost for Vj 
is no bigger than 0. Algorithm [^terminates in | V| iterations since 
one version is added into X in each iteration. 

Example 5. Say we operate on G given by Figure^ and let 
the threshold 0 be 6. Each version Vi is associated with a pair 
{I(Vi), d(Vi)). Initially version Vb is pushed into priority queue. 
When Vb is dequeued, each neighbor Vj updates < l(Vj), d(Vj) > 










Figure 8: Directed Graph G Figure 9: Undirected Graph G 



Figure 10: Illustration of Modified Prim’s algorithm in Figurej^ 


as shown in Figure\Id\(a). Notice that ^ ^ for all i is 

simply the storage cost for that version. For example, when con¬ 
sidering edge (Vb, Ui), liVi) — 3 and diVi) — is updated since 
recreation cost (ifVi is to be stored in its entirety) is smaller than 
threshold 0, i.e., 3 < 6. Afterwards, version Vi,U 2 and Vs are 
inserted into the priority queue. Next, we dequeue Vi since l{Vi) 
is smallest among the versions in the priority queue, and add Vi 
to the spanning tree. We then update < l{Vj), d{Vj) > for all 
neighbors of Vi, e.g., the recreation cost for version V 2 will be 
6 and the storage cost will be 2 when considering edge (Vi, V 2 ). 
Since 6 < 6, (/(U 2 ), o?(U 2 )) is updated to (2, 6) as shown in Figure 


10 (b); however, < l{Vf)^d{yf) > will not be updated since the 


recreation cost 3 + 4 > 6 when considering edge (Vi, V 3 ). Sub¬ 
sequently, version V 2 is dequeued because it has the lowest /(U 2 ), 
and is added to the tree, giving Figure^^(b). Subsequently, version 
Vs are dequeued. When Vs is dequeued from PQ, (/(U 2 ), o?(U 2 )) is 
updated. This is because the storage cost for V 2 can be updated to 
1 and the recreation cost is still 6 when considering edge (Vs, V 2 ), 
even ifV 2 is already in T as shown in Figure^^(c). Eventually, we 
get the final answer in Figure\I^(d). 


Complexity. The complexity of the algorithm is the same as that of 
Prim’s algorithm, i.e., 0(\E\ log |U|). Each edge is scanned once 
and the priority queue need to be updated once in the worst case. 

4.3 LAST Algorithm 

Here, we sketch an algorithm from previous work (22| that en¬ 
ables us to find a tree with a good balance of storage and recreation 
costs, under the assumptions that A = 4> and 4> is symmetric. 

Outline. The algorithm starts from a minimum spanning tree and 
does a depth-first traveral (DFS) over the minimum spanning tree. 
During the process of DFS, if the recreation cost for a node exceeds 
the pre-defined threshold (set up front), then this current path is 
replaced with the shortest path to the node. 

Detailed Algorithm. As discussed in Section |2.2| balancing be¬ 
tween recreation cost and storage cost is equivalent to balancing be¬ 
tween the minimum spanning tree and the shortest path tree rooted 


Algorithm 2: Modified Prim’s Algorithm 

Input : Graph G = (U, F^), threshold 0 
Output: Spanning Tree T — {Vt, Et) 

1 Let X be the version set of current spanning tree T; Initially 

T = 0,A: = 0; 

2 Let p{Vi) be the parent of Vp, I (Vi) denote the storage cost 
from p (Vi) io Vi, d{Vi) denote the recreation cost from root 
Vo to version Vi, 

3 Initially Vi / 0, d{Vo) = l{Vo) = 0, d(U) = i(U) = 00 ; 

4 Enqueue < Vb, (/(Vb), d(Vb)) > into priority queue PQ; 

5 (PQ is sorted by I(vi)); 

6 while PQ / 0 do 

7 < U, {l{V),d{V)) top(PQ), dequeue(PQ); 

8 T = TU < Vi,p{Vi) >,X = XUVi; 

9 for Vj G (Vi's neighbors in G) do 


10 


itVj 

G X then 

11 



if (^i,j + d(Vi)) < d(Vj) and Aij < l(Vj) then 

12 




T = T-<Vj,p{Vj)>- 

13 




p{Vj) = V; 

14 




T = TU < Vi,p{Vj) > 





d{Vj) ^ 

15 




liVj) ^ Aij- 

16 



end 

17 


end 


18 


else 


19 



if (T>i,j + d(Vi)) < 0 and Aij < l(Vj) then 

20 




d{Vj) ^ + d{vp 

21 




l{Vj)^Ai,f,p{Vj) = Vi-, 

22 




enqueue(or update) < Vj, iliVj),d{Vj)) > in 





PQ\ 

23 



end 

24 


end 


25 

end 



26 end 





at Vb. Khuller et al. | [^ studied the problem of balancing mini¬ 
mum spanning tree and shortest path tree in an undirected graph, 
where the resulting spanning tree T has the following properties, 
given parameter a: 

• Eor each node Vp. the cost of path from Vb to Vi in T is 
within a times the shortest path from Vb to Vi in G. 

• The total cost of T is within (1 + 2/(a — 1)) times the cost 
of minimum spanning tree in G. 

Even though Khuller’s algorithm is meant for undirected graphs, it 
can be applied to the directed graph case without any comparable 
guarantees. The pseudocode is listed in Algorithmic 

Let MST denote the minimum spanning tree of graph G and 
5'P(Vb, Vi) denote the shortest path from Vb to Vi in G. The algo¬ 
rithm starts with the MST and then conducts a depth-first traversal 
in MST. Each node V keeps track of its path cost from root as 
well as its parent, denoted as d{Vi) and p{Vi) respectively. Given 
the approximation parameter a, when visiting each node Vi, we 
first check whether d{Vi) is bigger than a x SP(Vo,Vi) where 
SP stands for shortest path. If yes, we replace the path to Vi with 
the shortest path from root to Vi in G and update d(Vi) as well as 
p(Vi). In addition, we keep updating d(Vi) andp(U) during depth 
first traversal as stated in line 4-7 of Algorithm[C 

Example 6. Figure^l\(a) is the minimum spanning tree (MST) 
rooted at node Vb of G in Figure^ The approximation threshold a 


















Algorithm 3: Balance MST and Shortest Path Tree 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


Input : Graph G = {V, E), MST, SP 
Output: Spanning Tree T = {Vt,Et) 

Initialize T as MST. Let d(Vi) be the distance from Vb to 
in T and p(I4) be the parent of Vi in T. 
while DFS traversal on MST do 

{Vi,Vj) ^ the edge currently in traversal; 

if d{Vj) > d{Vi) + eij then 
d{Vj) ^ id{Vi) T eij); 

end 

if d{Vj) > a * SP{Vo, Vj) then 
add shortest path {Vo,Vj) into T; 
d{Vj) ^ SPiVo^Vj)-, 
p{Vj) ^ Vo-, 

end 

end 



is set to be 2. The algorithm starts with the MST and conducts 
a depth-first traversal in the MST from root Vb- When visiting 
node Vb, d{y 2 ) — 3 and the shortest path to node Vb is 3, thus 
3 < 2 X 3. We continue to visit node V 2 and V3. When visiting 
Vo, d{Vo) = 8 > 2 X 3 where 3 is the shortest path to V3 in 
G. Thus, diVo) is set to be 3 and p{Vo) is set to be node 0 by re¬ 
placing with the shortest path (Vb, Vs) as shown in Figure\n\(b). 
Afterwards, the back-edge < V3, Vi > is traversed in MST Since 
3 + 2 < 6, where 3 is the current value of diVo), 2 is the edge 
weight of {Vo, Vi) and 6 is the current value in d(Vi), thus diVi) 
is updated as 5 and p(Vi) is updated as node V3. At last node V 4 
is visited, d{V 4 ) is first updated as 7according to line 3-7. Since 
7 < 2 X 4, lines 9-11 are not executed. Figure^l^(c) is the re¬ 
sulting spanning tree of the algorithm, where the recreation cost 
for each node is under the constraint and the total storage cost is 
3 + 3 + 2 + 2=10. 


Complexity. The complexity of the algorithm is 0(\E\ log |y|). 
Given the minimum spanning tree and shortest path tree rooted at 
Vb, Algorithmj^is conducted via depth first traversal on MST. It is 
easy to show that the complexity for Algorithm]^ is 0(|y|). The 
time complexity for computing minimum spanning tree and short¬ 
est path tree is 0(|L^| log |y |) using heap-based priority queue. 

4.4 Git Heuristic 

This heuristic is an adaptation of the current heuristic used by 
Git and we refer to it as GitH. We sketch the algorithm here and 
refer the reader to Appendix [A| for our analysis of Git’s heuristic. 
GitH uses two parameters: w (window size) and d (max depth). 


We consider the versions in an non-increasing order of their sizes. 
The first version in this ordering is chosen as the root of the stor¬ 
age graph and has depth 0 (i.e., it is materialized). At all times, we 
maintain a sliding window containing at most w versions. For each 
version Vi after the first one, let Vi denote a version in the current 
window. We compute: Aj ^ = Ai^i/{d — di), where di is the depth 
of Vi (thus deltas with shallow depths are preferred over slightly 
smaller deltas with higher depths). We find the version Vj with the 
lowest value of this quantity and choose it as I^’s parent (as long as 
dj < d). The depth of Vi is then set to dj + 1. The sliding window 
is modified to move Vi to the end of the window (so it will stay in 
the window longer), Vj is added to the window, and the version at 
the beginning of the window is dropped. 

Complexity. The running time of the heuristic is 0(|y| log \ V\ -\- 
w\V\), excluding the time to construct deltas. 

5 . EXPERIMENTS 

We have built a prototype version management system, that will 
serve as a foundation to DataHub O The system provides a 
subset of Git/SVN-like interface for dataset versioning. Users in¬ 
teract with the version management system in a client-server model 
over HTTP. The server is implemented in Java, and is responsible 
for storing the version history of the repository as well as the actual 
files in them. The client is implemented in Python and provides 
functionality to create (commit) and check out versions of datasets, 
and create and merge branches. Note that, unlike traditional VCS 
which make a best effort to perform automatic merges, in our sys¬ 
tem we let the user perform the merge and notify the system by 
creating a version with more than one parent. 

Implementation. In the following sections, we present an exten¬ 
sive evaluation of our designed algorithms using a combination of 
synthetic and derived real-world datasets. Apart from implement¬ 
ing the algorithms described above, LMG and LAST require both 
SPT and MST as input. For both directed and undirected graphs, 
we use Dijkstra’s algorithm to find the single-source shortest path 
tree (SPT). We use Prim’s algorithm to find the minimum span¬ 
ning tree for undirected graphs. For directed graphs, we use an 
implementation 0 of the Edmonds’ algorithm | [38) for computing 
the min-cost arborescence (MCA). We ran all our experiments on 
a 2.2GHz Intel Xeon CPU E5-2430 server with 64GB of memory, 
running 64-bit Red Hat Enterprise Linux 6.5. 

5.1 Datasets 

We use four data sets: two synthetic and two derived from real- 
world source code repositories. Although there are many publicly 
available source code repositories with large numbers of commits 
(e.g., in GitHub), those repositories typically contain fairly small 
(source code) files, and further the changes between versions tend 
to be localized and are typically very small; we expect dataset ver¬ 
sions generated during collaborative data analysis to contain much 
larger datasets and to exhibit large changes between versions. We 
were unable to find any realistic workloads of that kind. 

Hence, we generated realistic dataset versioning workloads as 
follows. First, we wrote a synthetic version generator suite, driven 
by a small set of parameters, that is able to generate a variety of 
version histories and corresponding datasets. Second, we created 
two real-world datasets using publicly available forks of popular 
repositories on GitHub. We describe each of the two below. 
Synthetic Datasets: Our synthetic dataset generation suit^takes a 

^Our synthetic dataset generator may be of independent interest to 











Dataset 

DC 

LC 

BF 

LF 

Number of versions 

100010 

100002 

986 

100 

Number of deltas 

18086876 

2916768 

442492 

3562 

Average version size (MB) 

347.65 

356.46 

0.401 

422.79 

MCA-Storage Cost (GB) 

1265.34 

982.27 

0.0250 

2.2402 

MCA-Sum Recreation Cost (GB) 

11506437.83 

29934960.95 

0.9648 

47.6046 

MCA-Max Recreation Cost (GB) 

257.6 

717.5 

0.0063 

0.5998 

SPT-Storage Cost (GB) 

33953.84 

34811.14 

0.3854 

41.2881 

SPT-Sum Recreation Cost (GB) 

33953.84 

34811.14 

0.3854 

41.2881 

SPT-Max Recreation Cost (GB) 

0.524 

0.55 

0.0063 

0.5091 



Figure 12: Dataset properties and distribution of delta sizes (each delta size scaled by the average version size in the dataset). 


two-step approach to generate a dataset that we sketch below. The 
first step is to generate a version graph with the desired structure, 
controlled by the following parameters: 

• number of commits , i.e., the total number of versions. 

• branch interval and probability, the number of con¬ 
secutive versions after which a branch can be created, and 
probability of creating a branch. 

• branch limit, the maximum number of branches from 
any point in the version history. We choose a number in 
[1, branch limit] uniformly at random when we decide 
to create branches. 

• branch length, the maximum number of commits in any 
branch. The actual length is a uniformly chosen integer be¬ 
tween 1 and branch length. 

Once a version graph is generated, the second step is to generate 
the appropriate versions and compute the deltas. The files in our 
synthetic dataset are ordered CSV files (containing tabular data) 
and we use deltas based on UNIX-style diffs. The previous step 
also annotates each edge (u, v) in the version graph with edit com¬ 
mands that can be used to produce v from u. Edit commands are a 
combination of one of the following six instructions - add/delete a 
set of consecutive rows, add/remove a column, and modify a subset 
of rows/columns. 

Using this, we generated two synthetic datasets (Figure p^: 

• Densely Connected (DC): This dataset is based on a “fiat” 
version history, i.e., number of branches is high, they occur 
often and have short lengths. For each version in this data set, 
we compute the delta with all versions in a 10-hop distance 
in the version graph to populate additional entries in A and 
T>. 

• Linear Chain (LC): This dataset is based on a “mostly- 
linear” version history, i.e., number of branches is low, they 
occur after large intervals and have longer lenghts. For each 
version in this data set, we compute the delta with all versions 
within a 25-hop distance in the version graph to populate A 
and T>. 

Real-world datasets: We use 986 forks of the Twitter Bootstrap 
repository and 100 forks of the Linux repository, to derive our real- 
world workloads. For each repository, we checkout the latest ver¬ 
sion in each fork and concatenate all files in it (by traversing the 
directory structure in lexicographic order). Thereafter, we compute 
deltas between all pairs of versions in a repository, provided the size 

researchers working on version management. 


difference between the versions under consideration is less than a 
threshold. We set this threshold to 100KB for the Twitter Bootstrap 
repository and 10MB for the Linux repository. This gives us two 
real-world datasets. Bootstrap Forks (BF) and Linux Forks (LF), 
with properties shown in Figure 

5.2 Comparison with SVN and Git 

We begin with evaluating the performance of two popular ver¬ 
sion control systems, SVN (vl.8.8) and Git (vl.7.1), using the LF 
dataset. We create an FSFS-type repository in SVN, which is more 
space efficient that a Berkeley DB-based repository 0. We then 
import the entire LF dataset into the repository in a single com¬ 
mit. The amount of space occupied by the db/ revs/ directory is 
around 8.5GB and it takes around 48 minutes to complete the im¬ 
port. We contrast this with the naive approach of applying a gzip 
on the files which results in total compressed storage of 10.2GB. In 
case of Git, we add and commit the files in the repository and then 
run^it repack -a -d -depth=50 -window=50 ontherepos- 
itoiylj The size of the Git pack file is 202 MB although the repack 
consumes 55GB memory and takes 114 minutes (for higher win¬ 
dow sizes. Git fails to complete the repack as it runs out of mem¬ 
ory). 

In comparison, the solution found by the MCA algorithm occu¬ 
pies 516MB of compressed storage (2.24GB when uncompressed) 
when using UNIX dif f for computing the deltas. To make a fair 
comparison with Git, we use xdiff from the LibXDiff library 0 
for computing the deltas, which forms the basis of Git’s delta com¬ 
puting routine. Using xdiff brings down the total storage cost 
to just 159 MB. The total time taken is around 102 minutes; this 
includes the time taken to compute the deltas and then to find the 
MCA for the corresponding graph. 

The main reason behind SVN’s poor performance is its use of 
“skip-deltas” to ensure that at most O(logn) deltas are needed 
for reconstructing any version 0; that tends to lead it to repeat¬ 
edly store redundant delta information as a result of which the total 
space requirement increases significantly. The heuristic used by 
Git is much better than SVN (Section [40. However as we show 
later (Fig.[0), our implementation of that heuristic (GitH) required 
more storage than LMG for guaranteeing similar recreation costs. 

5.3 Experimental Results 

Directed Graphs. We begin with a comprehensive evaluation of 
the three algorithms, LMG, MP, and LAST, on directed datasets. 

^Unlike git repack, svnadmin pack has a negligible effect 
on the storage cost as it primarily aims to reduce disk seeks and 
per-version disk usage penalty by concatenating files into a single 
“pack” 00. 
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Figure 13: Results for the directed case, comparing the storage costs and total recreation costs 




Figure 15: Results for the undirected case, comparing the storage costs and total recreation costs (a-c) or maximum recreation costs (d) 
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Figure 14: Results for the directed case, comparing the storage 
costs and maximum recreation costs 


Given that all of these algorithms have parameters that can be used 
to trade off the storage cost and the total recreation cost, we com¬ 
pare them by plotting the different solutions they are able to find 
for the different values of their respective input parameters. Fig¬ 
ure [T^a-d) show four such plots; we run each of the algorithms 
with a range of different values for its input parameter and plot the 
storage cost and the total (sum) recreation cost for each of the solu¬ 
tions found. We also show the minimum possible values for these 
two costs: the vertical dashed red line indicates the minimum stor¬ 
age cost required for storing the versions in the dataset as found by 
MCA, and the horizontal one indicates the minimum total recre¬ 
ation cost as found by SPT (equal to the sum of all version sizes). 

The first key observation we make is that, the total recreation 
cost decreases drastically by allowing a small increase in the stor¬ 
age budget over MCA. For example, for the DC dataset, the sum 
recreation cost for MCA is over 11 PB (see Table[^ as compared 
to just 34TB for the SPT solution (which is the minimum possi¬ 
ble). As we can see from Figure p^a), a space budget of 1.1 x 
the MCA storage cost reduces the sum of recreation cost by three 


orders of magnitude. Similar trends can be observed for the remain¬ 
ing datasets and across all the algorithms. We observe that LMG 
results in the best tradeoff between the sum of recreation cost and 
storage cost with LAST performing fairly closely. An important 
takeaway here, especially given the amount of prior work that 
has focused purely on storage cost minimization (Section]^, is 
that: it is possible to construct balanced trees where the sum 
of recreation costs can be reduced and brought close to that of 
SPT while using only a fraction of the space that SPT needs. 

We also ran GitH heuristic on the all the four datasets with vary¬ 
ing window and depth settings. For BF, we ran the algorithm with 
four different window sizes (50, 25, 20,10) for a fixed depth 10 and 
provided the GitH algorithm with all the deltas that it requested. 
For all other datasets, we ran GitH with an infinite window size 
but restricted it to choose from deltas that were available to the 
other algorithms (i.e., only deltas with sizes below a threshold); as 
we can see, the solutions found by GitH exhibited very good to¬ 
tal recreation cost, but required significantly higher storage than 
other algorithms. This is not surprising given that GitH is a greedy 
heuristic that makes choices in a somewhat arbitrary order. 

In Figures p^a-b), we plot the maximum recreation costs in¬ 
stead of the sum of recreation costs across all versions for two of 
the datasets (the other two datasets exhibited similar behavior). The 
MP algorithm found the best solutions here for all datasets, and we 
also observed that LMG and LAST both show plateaus for some 
datasets where the maximum recreation cost did not change when 
the storage budget was increased. This is not surprising given that 
the basic MP algorithm tries to optimize for the storage cost given 
a bound on the maximum recreation cost, whereas both LMG and 
LAST focus on minimization of the storage cost and one version 
with high recreation cost is unlikely to affect that significantly. 

Undirected Graphs. We test the three algorithms on the undi¬ 
rected versions of three of the datasets (Figure p~5]). For DC and 





















































































Figure 16: Taking workload into account leads to better solutions 


LC, undirected deltas between pairs of versions were obtained by 
concatenating the two directional deltas; for the BF dataset, we use 
UNIX dif f itself to produce undirected deltas. Here again we ob¬ 
serve that LMG consistently outperforms the other algorithms in 
terms of finding a good balance between the storage cost and the 
sum of recreation costs. MP again shows the best results when try¬ 
ing to balance the maximum recreation cost and the total storage 
cost. Similar results were observed for other datasets but are omit¬ 
ted due to space limitations. 

Workload-aware Sum of Recreation Cost Optimization. In many 
cases, we may be able to estimate access frequencies for the vari¬ 
ous versions (from historical access patterns), and if available, we 
may want to take those into account when constructing the stor¬ 
age graph. The LMG algorithm can be easily adapted to take such 
information into account, whereas it is not clear how to adapt ei¬ 
ther LAST or MP in a similar fashion. In this experiment, we use 
LMG to compute a storage graph such that the sum of recreation 
costs is minimal given a space budget, while taking workload in¬ 
formation into account. The worload here assigns a frequency of 
access to each version in the repository using a Zipfian distribu¬ 
tion (with exponent 2); real-world access frequencies are known to 
follow such distributions. Given the workload information, the al¬ 
gorithm should find a storage graph that has the sum of recreation 
cost less than the index when the workload information is not taken 
into account (i.e., all versions are assumed to be accessed equally 
frequently). Figure[^ shows the results for this experiment. As we 
can see, for the DC dataset, taking into account the access frequen¬ 
cies during optimization led to much better solutions than ignoring 
the access frequencies. On the other hand, for the LF dataset, we 
did not observe a large difference. 

Running Times. Here we evaluate the running times of the LMG 
algorithm. Recall that LMG takes MST (or MCA) and SPT as in¬ 
puts. In Fig. we report the total running time as well as the 
time taken by LMG itself. We generated a set of version graphs 
as subsets of the graphs for LC and DC datasets as follows: for a 
given number of versions n, we randomly choose a node and tra¬ 
verse the graph starting at that node in breadth-first manner till we 
construct a subgraph with n versions. We generate 5 such sub¬ 
graphs for increasing values of n and report the average running 
time for LMG; the storage budget for LMG is set to three times of 
the space required by the MST (all our reported experiments with 
LMG use less storage budget than that). The time taken by LMG 
on DC dataset is more than LC for the same number of versions; 
this is because DC has lower delta values than LC (see Fig.[^ and 
thus requires more edges from SPT to satisfy the storage budget. 

On the other hand, MP takes between 1 to 8 seconds on those 
datasets, when the recreation cost is set to maximum. Similar to 
LMG, LAST requires the MST/MCA and SPT as inputs; however 
the running time of LAST itself is linear and it takes less than 1 
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Figure 17: Running times of LMG 




Storage Cost (GB) 

vl5 

0 

0.20 

0.21 

0.22 

0.23 

0.24 


ILP 

0.36 

0.36 

0.22 

0.22 

0.22 


MP 

0.36 

0.36 

0.23 

0.23 

0.23 

v25 

e 

0.63 

0.66 

0.69 

0.72 

0.75 


ILP 

2.39 

1.95 

1.50 

1.18 

1.06 


MP 

2.88 

2.13 

1.7 

1.18 

1.18 

v50 

6 

0.30 

0.34 

0.41 

0.54 

0.68 


ILP 

1.43 

1.10 

0.83 

0.66 

0.60 


MP 

1.59 

1.45 

1.06 

0.91 

0.82 


Table 2: Comparing ILP and MP solutions for small datasets, given 
a bound on max recreation cost, 0 (in GB) 


second in all cases. Finally the time taken by GitH on LC and DC 
datasets, on varying window sizes range from 35 seconds (window 
= 1000) to a little more than 120 minutes (window = 100000); note 
that, this excludes the time for constructing the deltas. 

In summary, although LMG is inherently a more expensive algo¬ 
rithm than MP or LAST, it runs in reasonable time on large input 
sizes; we note that all of these times are likely to be dwarfed by the 
time it takes to construct deltas even for moderately-sized datasets. 

Comparison with ILP solutions. Finally, we compare the quality 
of the solutions found by MP with the optimal solution found using 
the Gurobi Optimizer for Problem We use the ILP formulation 
from Section 12.31 with constraint on the maximum recreation cost 
(0), and compare the optimal storage cost with that of the MP algo¬ 
rithm (which resulted in solutions with lowest maximum recreation 
costs in our evaluation). We use our synthetic dataset generation 
suite to generate three small datasets, with 15, 25 and 50 versions 
denoted by vl5, v25 and v50 respectively and compute deltas be¬ 
tween all pairs of versions. Table [^reports the results of this exper¬ 
iment, across five 0 values. The ILP turned out to be very difficult 
to solve, even for the very small problem sizes, and in many cases, 
the optimizer did not finish and the reported numbers are the best 
solutions found by it. 

As we can see, the solutions found by MP are quite close to the 
ILP solutions for the small problem sizes for which we could get 
any solutions out of the optimizer. However, extrapolating from 
the (admittedly limited) data points, we expect that on large prob¬ 
lem sizes, MP may be significantly worse than optimal for some 
variations on the problems (we note that the optimization problem 
formulations involving max recreation cost are likely to turn out 
to be harder than the formulations that focus on the average recre¬ 
ation cost). Development of better heuristics and approximation 
algorithms with provable guarantees for the various problems that 
we introduce are rich areas for further research. 




































6. RELATED WORK 

Perhaps the most closely related prior work is source code ver¬ 
sion systems like Git, Mercurial, SVN, and others, that are widely 
used for managing source code repositories. Despite their popular¬ 
ity, these systems largely use fairly simple algorithms underneath 
that are optimized to work with modest-sized source code files and 
their on-disk structures are optimized to work with line-based diffs. 
These systems are known to have significant limitations when han¬ 
dling large files and large numbers of versions |[^. As a result, a 
variety of extensions like git-annex Q, git-bigfiles pO) , etc., have 
been developed to make them work reasonably well with large files. 

There is much prior work in the temporal databases literature |[^ 
[TT) on managing a linear chain of versions, and retrieving 
a version as of a specific time point (called snapshot queries) p^ . 
|T5) proposed an archiving technique where all versions of the data 
are merged into one hierarchy. An element appearing in mul¬ 
tiple versions is stored only once along with a timestamp. This 
technique of storing versions is in contrast with techniques where 
retrieval of certain versions may require undoing the changes (un¬ 
rolling the deltas). The hierarchical data and the resulting archive 
is represented in XML format which enables use of XML tools 
such as an XML compressor for compressing the archive. It was 
not, however, a full-fiedged version control system representing an 
arbitrarily graph of versions; rather it focused on algorithms for 
compactly encoding a linear chain of versions. 

Snapshot queries have recently also been studied in the context 
of array databases p5] and graph databases f23\ . Seering 
et al. considered the problem of storing an arbitrary tree of 
versions in the context of scientific databases; their proposed tech¬ 
niques are based on finding a minimum spanning tree (as we dis¬ 
cussed earlier, that solution represents one extreme in the spectrum 
of solutions that needs to be considered). They also proposed sev¬ 
eral heuristics for choosing which versions to materialize given the 
distribution of access frequencies to historical versions. Several 
databases support “time travel” features (e.g., Oracle Flashback, 
Postgres p6|) . However, those do not allow for branching trees of 
versions. |20] articulates a similar vision to our overall DataHub 
vision; however, they do not propose formalisms or algorithms to 
solve the underlying data management challenges. In addition, the 
schema of tables encoded with Flashback cannot change. 

There is also much prior work on compactly encoding differ¬ 
ences between two files or strings in order to reduce communication 
or storage costs. In addition to standard utilities like UNIX dif f, 
many sophisticated techniques have been proposed for computing 
differences or edit sequences between two files (e.g., xdelta (23 
vdelta ||^, vcdiff zdelta |[^). That work is largely orthogo¬ 
nal and complementary to our work. 

Many prior efforts have looked at the problem of minimizing 
the total storage cost for storing a collection of related files (i.e.. 
Problem 1). These works do not typically consider the recreation 
cost or the tradeoffs between the two. Quinlan et al. ED propose 
an archival “deduplication” storage system that identifies duplicate 
blocks across files and only stores them once for reducing storage 
requirements. Zhu et al. | |40| p resent several optimizations on the 
basic theme. Doughs et al. |19| present several techniques to iden¬ 
tify pairs of files that could be efficiently stored using delta com¬ 
pression even if there is no explicit derivation information known 
about the two files; similar techniques could be used to better iden¬ 
tify which entries of the matrices A and T> to reveal in our sce¬ 
nario. Ouyang et al. p8) studied the problem of compressing a 
large collection of related files by performing a sequence of pair¬ 
wise delta compressions. They proposed a suite of text cluster¬ 
ing techniques to prune the graph of all pairwise delta encodings 


and find the optimal branching (i.e., MCA) that minimizes the to¬ 
tal weight. Burns and Long fT^ present a technique for in-place 
re-construction of delta-compressed files using a graph-theoretic 
approach. That work could be incorporated into our overall frame¬ 
work to reduce the memory requirements during reconstruction. 
Similar dictionary-based reference encoding techniques have been 
used by GD to efficiently represent a target web page in terms of 
additions/modifications to a small number of reference web pages. 
Kulkarni et al. p6| present a more general technique that combines 
several different techniques to identify similar blocks among a col¬ 
lection files, and use delta compression to reduce the total storage 
cost (ignoring the recreation costs). We refer the reader to a re¬ 
cent survey pO) for a more comprehensive coverage of this line of 
work. 

7 . CONCLUSIONS AND FUTURE WORK 

Large datasets and collaborative and iterative analysis are be¬ 
coming a norm in many application domains; however we lack the 
data management infrastructure to efficiently manage such datasets, 
their versions over time, and derived data products. Given the high 
overlap and duplication among the datasets, it is attractive to con¬ 
sider using delta compression to store the datasets in a compact 
manner, where some datasets or versions are stored as modifica¬ 
tions from other datasets; such delta compression however leads 
to higher latencies while retrieving specific datasets. In this paper, 
we studied the trade-off between the storage and recreation costs 
in a principled manner, by formulating several optimization prob¬ 
lems that trade off these two in different ways and showing that 
most variations are NP-Hard. We also presented several efficient 
algorithms that are effective at exploring this trade-off, and we pre¬ 
sented an extensive experimental evaluation using a prototype ver¬ 
sion management system that we have built. There are many in¬ 
teresting and rich avenues for future work that we are planning to 
pursue. In particular, we plan to develop online algorithms for mak¬ 
ing the optimization decisions as new datasets or versions are being 
created, and also adaptive algorithms that reevaluate the optimiza¬ 
tion decisions based on changing workload information. We also 
plan to explore the challenges in extending our work to a distributed 
and decentralized setting. 
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APPENDIX 

A. GIT REPACK 

Git uses delta compression to reduce the amount of storage re¬ 
quired to store a large number of files (objects) that contain du¬ 
plicated information. However, git’s algorithm for doing so is not 
clearly described anywhere. An old discussion with Linus has a 
sketch of the algorithm (TT) However there have been several 
changes to the heuristics used that don’t appear to be documented 
anywhere. 

The following describes our understanding of the algorithm based 
on the latest git source code|^ 

Here we focus on “repack”, where the decisions are made for a 
large group of objects. However, the same algorithm appears to be 
used for normal commits as well. Most of the algorithm code is in 
file: builtin/pack-objects. c 

Step 1 : Sort the objects, first by “type”, then by “name hash”, and 
then by “size” (in the decreasing order). The comparator is (line 
1503): 

static int type_size_sort(const void *_a, const 
void *_b) 

Note the name hash is not a true hash; the pack_name_hash ( ) 
function (pack-objects.h) simply creates a number from the last 
16 non-white space characters, with the last characters counting the 
most (so all files with the same suffix, e.g., . c, will sort together). 

Step 2: The next key function is ll_f ind_deltas ( ), which goes 
over the files in the sorted order. It maintains a list of W ob¬ 
jects (W = window size, default 10) at all times. For the next 
object, say O, it finds the delta between O and each of the ob¬ 
jects, say B, in the window; it chooses the the object with the mini¬ 
mum value of: delta (B, 0) / (max_depth - depth of B) 
where max_dept h is a parameter (default 50), and depth of B refers 
to the length of delta chain between a root and B. 

The original algorithm appears to have only used delta (B, 0) 
to make the decision, but the “depth bias” (denominator) was added 
at a later point to prefer slightly larger deltas with smaller delta 
chains. The key lines for the above part: 

• line 1812 (check each object in the window): 

ret = try_delta(n, m, max_depth, &mem_usage); 

• lines 1617-1618 (depth bias): 

max_size = (uint64_t)max_size * (max_depth - 
src->depth) / (max_depth - ref_depth + 1); 

^Cloned from https://github.com/git/git on 5/11/2015, 
commit id: 8440f74997cf7958c7e8ec853f590828085049b8 



• line 1678 (compute delta and compare size): 

delta_buf = create_delta(src->index, trg->data, 
trg_size, &delta_size, max_size); 

create_delta () returns non-null only if the new delta being 
tried is smaller than the current delta (modulo depth bias), specif¬ 
ically, only if the size of the new delta is less than max_size ar¬ 
gument. Note: lines 1682-1688 appear redundant given the depth 
bias calculations. 

Step 3. Originally the window was just the last W objects be¬ 
fore the object O under consideration. However, the current al¬ 
gorithm shuffles the objects in the window based on the choices 
made. Specifically, let bi,, bw be the current objects in the 
window. Let the object chosen to delta against for O be 6 ^. Then 
bi would be moved to the end of the list, so the new list would be: 

[5i, ^ 2 ,..., ..., bw, O, bi]. Then when we move to the 

new object after O (say O'), we slide the window and so the new 
window then would be: [^ 2 ,..., 6 ^- 1 , 6 ^+ 1 ,..., bw, 0,bi, O']. Small 
detail: the list is actually maintained as a circular buffer so the list 
doesn’t have to be physically “shifted” (moving bi to the end does 
involve a shift though). Relevant code here is lines 1854-1861. 

Finally we note that git never considers/computes/stores a delta 
between two objects of different types, and it does the above in 
a multi-threaded fashion, by partitioning the work among a given 
number of threads. Each of the threads operates independently of 
the others. 





